Georgia Institute of Technology -
EpiDetector
VAST 2010 Challenge
Hospitalization Records – Characterization of Pandemic Spread
Authors and
Affiliations:
Jaeyeon Kihm, Georgia Institute of Technology, jkihm3@gatech.edu [PRIMARY contact]
Jaegul Choo, Georgia Institute of Technology, joyfull@cc.gatech.edu
Carsten Gorg, Georgia Institute of Technology, goerg@cc.gatech.edu
Hanseung Lee, Georgia Institute of Technology, hanseung.lee@gatech.edu
Zhicheng Liu, Georgia Institute of Technology, zliu6@cc.gatech.edu
Heasun Park, Georgia Institute of Technology, hpark@cc.gatech.edu
John Stasko, Georgia Institute of Technology, stasko@cc.gatech.edu
Tool(s):
The visual analytics tool for the VAST 2010 mini challenge 2, EpiDetector, visualizes the overall hospitalization records across cities involved in the epidemic. We developed this tool here at the Georgia Institute of Technology for the challenge. It enables users to analyze an epidemic outbreak across cities as well as the syndromes causing deaths. This application has three different interactive views:
· Main frame: The user can select either a city/location or a syndrome for filtering the data relevant to the selection.
· Overview of hospitalization records: An upper graph shows a timeline with the number of admitted patients per date or the number of patients who were admitted and died on a date, along with the causing medical syndrome. A lower graph shows mortalities and the number of days from the patient’s admittance to their death.
· Syndrome composition view: This window shows the composition of syndromes and the list of symptoms for a set of mortalities.
NOTE: In this project, we call the individual patient’s problem listed in the given data files as a “symptom.” We aggregate similar symptoms into eight primary “syndromes”.
i.e. “vomiting, diarrhea” is a “symptom” but “gastrointestinal” is one of the eight syndromes we used in this project.
Video:
ANSWERS:
MC2.1: Analyze the
records you have been given to characterize the spread of the disease.
You should take into consideration symptoms of the disease, mortality rates,
temporal patterns of the onset, peak and recovery of the disease. Health
officials hope that whatever tools are developed to analyze this data might be
available for the next epidemic outbreak. They are looking for
visualization tools that will save them analysis time so they can react
quickly.
1. Method
The hospitalization records include admission date, age,
gender, patient id, and symptoms for the patients, and death date and patient
id in the death records. Since characterizing the spread of the disease is the
main focus of this challenge, describing which syndrome occurred where and when
is of vital importance. First, we needed to clean the noisy, free-text symptoms
composed of many of abbreviations and misspelled words. We decided to take all
the different symptoms and classify them into 8 syndromes monitored by the
Real-time Outbreak and Disease Surveillance (RODS) at University of Pittsburgh.
We felt that the 8 simple syndromes would be more useful to recognize the
epidemic outbreak rather than all the different symptoms.
We used three different approaches to pre-process the
free-text symptoms. First, we divided two consecutive different symptoms that
were composited into one word such as “diarrheavomiting”. To do this, we
iterated through each symptom, one letter at a time, checking the string so far
for being a valid term in a medical dictionary. Second, we detected duplicated
words like “headacheheadache” using a similar idea but throwing away duplications.
Finally, we expanded abbreviations such as “ab” or “abd” for “abdomen” by
checking known lists. With this preprocessing, many ambiguous symptoms were
cleaned.
Next we created a rule-based classification using RODS syndromic definitions to determine which symptoms belonged to which syndromes.
Once the data was cleaned and the symptoms classified
into the 8 categories, we built a visualization tool to read the data and show
the results. Figure 1 shows a screenshot
from the application. The top graph
shows overall patient admittances to a hospital, and the lower region shows the
mortalities.
2. Mortality rates
We could analyze the mortality rate of each syndrome by
showing the number of patients who died (Figure 1). We created a bar chart
where each person is positioned on the day that they entered the hospital (on
the x-axis) and is colored by the number of days they were in the hospital
before dying. Figure 1 shows this pattern for Iran, and one can clearly see the
rise in the number of deaths in the middle of the period. Similarly, most of the bars are composed of
large red regions which correspond to the person being in the hospital for
eight days before passing away. To see which of the syndrome are most connected
to the epidemic, we compared the pattern of each syndrome on the upper and
lower graphs of Figure 1. Since the pattern of the gastrointestinal syndrome on
upper graph is most similar to the trend of the epidemic on lower graph, we
suspected that the gastrointestinal syndrome and the epidemic might be strongly
correlated. Specifically, we could see the number of patients having
gastrointestinal syndrome is dominant among the patients who died in eight days
on May 18, 2009 (Figure 2). Moreover, we could check the most frequent symptoms
on death records: vomiting (265 occurrences), pain (161), abdomen (122),
diarrhea (117) and fever (81) (Figure 2). The
same pattern of death records was also found on other cities except for Turkey
and Thailand (Figure 3), so we could expect that the epidemic is infectious and
may transmitted to all cities except for those two locations.
3. Outbreak Pattern across cities
We found that it typically makes patients die in 8 days. By analyzing the pattern of epidemic outbreak of each city we could see the outbreak pattern across cities (table1).
|
Onset |
Peak |
Recovery |
Outbreak duration |
Nairobi |
April 20 |
May 14 |
June 16 |
58 days |
Lebanon |
April 22 |
May 16 |
June 18 |
58 days |
Venezuela |
April 22 |
May 18 |
June 19 |
59 days |
Aleppo |
April 24 |
May 15 |
June 17 |
55 days |
Yemen |
April 24 |
May 17 |
June 18 |
56 days |
Karachi |
April 24 |
May 17 |
June 18 |
56 days |
Iran |
April 24 |
May 18 |
June 20 |
58 days |
Saudi Arabia |
April 25 |
May 18 |
June 20 |
57 days |
Colombia |
April 26 |
May 20 |
June 19 |
55 days |
Table 1. Key timings and statistics about the disease outbreak in different locations.
We set the onset to be the date of the first suspected death, the peak to be the day having most suspected deaths, and the recovery to be the last day a suspected death occurred. Locations are sorted in the table by the onset date.
The epidemic appeared to begin in Nairobi, Kenya with early onset also in Venezuela and Lebanon. It quickly spread to Syria, Yemen, Pakistan, Iran, Saudi Arabia, and Colombia. One might expect that neighboring countries would soon be at risk.
MC2.2:
Compare the outbreak across cities. Factors to consider include timing of
outbreaks, numbers of people infected and recovery ability of the individual
cities. Identify any anomalies you found.
In MC 2-1, we characterized the timing of the epidemic in different cities. The timing of the outbreaks was very close, spread by just a few days. All cities had very similar recovery and duration times as well, except of course, for Turkey and Thailand. Aleppo Syria and Colombia seemed to recover slightly faster than other cities. Using our application, we could get the number of people infected and a mortality rate by checking all dates for the number of deaths and the total number of patients in each city. Table 2 shows the results and indicates that Saudi Arabia had the lowest mortality rate and Aleppo the highest. Referring to table 1, we might also say that Colombia has the best recovery ability with the shortest outbreak duration even though its death rate was not quite low in table 2. Again Thailand and Turkey doesn’t have the pattern of deaths making patient die in 8 days and main death reason people die in these location was the syndrome labeled with “other”.
|
Death rates (the # of deaths / the # of patients) |
Death rates of people infected (the # of death infected / the # of patients) |
Number of deaths infected |
Aleppo |
3.51% |
3.44% |
78672 |
Colombia |
2.32% |
2.23% |
16338 |
Iran |
2.20% |
2.12% |
11926 |
Karachi |
2.31% |
2.24% |
165605 |
Lebanon |
1.73% |
1.66% |
7646 |
Nairobi |
3.40% |
3.34% |
43959 |
Saudi Arabia |
1.62% |
1.54% |
21529 |
Venezuela |
2.27% |
2.20% |
3717 |
Yemen |
2.56% |
2.48% |
7711 |
Table 2. The number of deaths infected and the death rate of each city
Figures:
Figure
1. Location
overview of hopitalization records - Iran
Figure
2. Syndrome composition view
Figure 3. Location
overview of hospitalization records - Turkey